{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Text Classification using Native Tensorflow Pre-processing</h1>\n",
    "\n",
    "This notebook continues from <a href=\"text_classification.ipynb\">text_classification.ipynb</a> -- in particular, we assume that you have run the \"Create Dataset from BigQuery\" and \"Rerun with Pre-trained embedding\" sections of that notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'\n",
    "PROJECT = 'cloud-training-demos'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION\n",
    "os.environ['TFVERSION'] = '1.8'\n",
    "\n",
    "if 'COLAB_GPU' in os.environ:  # this is always set on Colab, the value is 0 or 1 depending on whether a GPU is attached\n",
    "  from google.colab import auth\n",
    "  auth.authenticate_user()\n",
    "  # download \"sidecar files\" since on Colab, this notebook will be on Drive\n",
    "  !rm -rf txtclsmodel\n",
    "  !git clone --depth 1 https://github.com/GoogleCloudPlatform/training-data-analyst\n",
    "  !mv  training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/ .\n",
    "  !rm -rf training-data-analyst\n",
    "  # downgrade TensorFlow to the version this notebook has been tested with\n",
    "  !pip install --upgrade tensorflow==$TFVERSION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-requisites\n",
    "Ensure you have the data files and pre-trained embedding file in your GCS bucket:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud storage ls gs://$BUCKET/txtcls/eval.tsv\n",
    "gcloud storage ls gs://$BUCKET/txtcls/train.tsv\n",
    "gcloud storage ls gs://$BUCKET/txtcls/glove.6B.200d.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you don't, go back and run the <a href=\"text_classification.ipynb\">text_classification</a> notebook first. Specifically, run the \"Create Dataset from BigQuery\" and \"Rerun with Pre-trained embedding\" sections of that notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Native Tensorflow Predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Why Native?\n",
    "\n",
    "Up until now we've been using python functions to do our data pre-processing. This is fine during training, but during serving it adds the limitation that we need a python client in the prediction pipeline.\n",
    "\n",
    "This limits your serving flexibility. For example, lets say you want to be able to serve this model locally (offline) on a mobile phone. How would you do it? It's non trivial to execute python code on Android. Plus if your vocabulary tokenization ever changed you'd need to update all the clients or else they would be out of sync with the model.\n",
    "\n",
    "A better way would be to have all of our serving pre-processing to be done using native Tensorflow operations. This way we can take advantage of Tensorflow's hardware agnostic execution engine, and leverage the engineering efforts the Tensorflow team put into making sure our code works whether we're running on a server, mobile, or an embedded device!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TensorFlow/Keras Code\n",
    "\n",
    "Please explore the code in this <a href=\"txtclsmodel/trainer\">directory</a>: `model_native.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job. \n",
    "\n",
    "In particular look for the follwing:\n",
    "\n",
    "1. [tf.keras.preprocessing.text.Tokenizer.fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) to generate a mapping from our word vocabulary to integers\n",
    "2. [tf.gfile](https://www.tensorflow.org/api_docs/python/tf/gfile/GFile) to write the vocabulary mapping to disk\n",
    "3. [tf.contrib.lookup.index_table_from_file()](https://www.tensorflow.org/api_docs/python/tf/contrib/lookup/index_table_from_file) to encode our sentences into a tensor of their respective word-integers, based on the vocabulary mapping written to disk in the previous step\n",
    "4. [tf.pad](https://www.tensorflow.org/api_docs/python/tf/pad) to pad all sequences to be the same length"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train on the Cloud\n",
    "\n",
    "Note the new `--native` parameter. This tells `task.py` to call `model_native.py` instead of `model.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/txtcls/trained_finetune_native\n",
    "JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",
    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    " --region=$REGION \\\n",
    " --module-name=trainer.task \\\n",
    " --package-path=${PWD}/txtclsmodel/trainer \\\n",
    " --job-dir=$OUTDIR \\\n",
    " --scale-tier=BASIC_GPU \\\n",
    " --runtime-version=$TFVERSION \\\n",
    " -- \\\n",
    " --output_dir=$OUTDIR \\\n",
    " --train_data_path=gs://${BUCKET}/txtcls/train.tsv \\\n",
    " --eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \\\n",
    " --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt \\\n",
    " --native \\\n",
    " --num_epochs=5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy trained model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "MODEL_NAME=\"txtcls\"\n",
    "MODEL_VERSION=\"v1_finetune_native\"\n",
    "MODEL_LOCATION=$(gcloud storage ls gs://${BUCKET}/txtcls/trained_finetune_native/export/exporter/ | tail -1)\n",
    "#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} --quiet\n",
    "#gcloud ml-engine models delete ${MODEL_NAME}\n",
    "#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION\n",
    "gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample prediction instances\n",
    "Here are some actual hacker news headlines gathered from July 2018. These titles were not part of the training or evaluation datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "techcrunch=[\n",
    "  'Uber shuts down self-driving trucks unit',\n",
    "  'Grover raises €37M Series A to offer latest tech products as a subscription',\n",
    "  'Tech companies can now bid on the Pentagon’s $10B cloud contract'\n",
    "]\n",
    "nytimes=[\n",
    "  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',\n",
    "  'A $3B Plan to Turn Hoover Dam into a Giant Battery',\n",
    "  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'\n",
    "]\n",
    "github=[\n",
    "  'Show HN: Moon – 3kb JavaScript UI compiler',\n",
    "  'Show HN: Hello, a CLI tool for managing social media',\n",
    "  'Firefox Nightly added support for time-travel debugging'\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Predictions\n",
    "Note how we can now feed the titles directly to the model! No need for the client to load a vocabulary tokenization mapping. The text tokenization is done for us inside of the Tensorflow model's serving_input_fn. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from googleapiclient import discovery\n",
    "from oauth2client.client import GoogleCredentials\n",
    "import json\n",
    "\n",
    "# JSON format the requests\n",
    "requests = (techcrunch+nytimes+github)\n",
    "requests = [x.lower() for x in requests] \n",
    "request_data = {'instances': requests}\n",
    "\n",
    "# Authenticate and call CMLE prediction API \n",
    "credentials = GoogleCredentials.get_application_default()\n",
    "api = discovery.build('ml', 'v1', credentials=credentials,\n",
    "            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')\n",
    "\n",
    "parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1_finetune_native')\n",
    "response = api.projects().predict(body=request_data, name=parent).execute()\n",
    "\n",
    "# Format and print response\n",
    "for i in range(len(requests)):\n",
    "  print('\\n{}'.format(requests[i]))\n",
    "  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))\n",
    "  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))\n",
    "  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### References\n",
    "- This implementation is based on code from: https://github.com/google/eng-edu/tree/master/ml/guides/text_classification.\n",
    "- See the full text classification tutorial at: https://developers.google.com/machine-learning/guides/text-classification/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
